This study investigates the factors influencing the price and cut quality of diamonds. By analyzing a comprehensive dataset, we aim to determine how attributes such as carat, color, clarity, depth, and table affect diamond prices. Additionally, we examine which features significantly impact the quality of a diamond’s cut. These insights can guide consumers and industry professionals in making informed decisions about diamond valuation and quality assessment.
The Problem DescriptionThis project examines the Price and Cut of diamonds. We will perform both regression and classification analysis. We will divide the data into training (80%) and testing (20%) datasets. The goal for the regression models is to predict the price of a diamond using the predictor variables in the dataset. For this analysis, we will look for relationship between Price and other variables. We will use various methods including Linear Regression, Bagged Tree and Random Forest. Then, we will perform classification analysis predicting if the diamond is of high or acceptable cut using the predictor variables. We will use Logistic Regression, Random Forest, and Gradient Boosted Model. Finally, we end with summarizing our conclusions. We will examine the variables in the dataset to determine what helps to predict the price and cut of a diamond.
The DataThis dataset has 53,940 rows and 10 variables. However, with the size of this dataset, I have decided to sample 25% of total observations for better performance and speed.
Data SourcesKaggle: Diamonds (link: https://www.kaggle.com/datasets/shivam2503/diamonds?resource=download)
From this data we can see that our variables have a variety of different values based on their types. carat has a mean of 0.79 but max of 4.1. We see several variables having a wide range of values, most noticeably price with a range from 336 to 18823, or width with a range of 0 to 9.94 There may be high correlation between the variables as well, for example depth and depth_perc, so we probably will remove one in the analysis. For our target variable cut, the value is High if the cut if Premium or Ideal, and Acceptable if the value is Fair, Good, or Very Good.
Now we can see the range of values for each variable.
carat cut color clarity price
Min. :0.2000 High :8873 Tier 1:4134 Tier 1:2633 Min. : 336
1st Qu.:0.4000 Acceptable:4612 Tier 2:5275 Tier 2:5070 1st Qu.: 954
Median :0.7000 Tier 3:4076 Tier 3:5782 Median : 2381
Mean :0.7952 Mean : 3904
3rd Qu.:1.0400 3rd Qu.: 5274
Max. :4.1300 Max. :18823
length width depth table_perc
Min. : 0.000 Min. :0.000 Min. :0.000 Min. :0.4300
1st Qu.: 4.710 1st Qu.:4.720 1st Qu.:2.910 1st Qu.:0.5600
Median : 5.690 Median :5.700 Median :3.520 Median :0.5700
Mean : 5.725 Mean :5.727 Mean :3.536 Mean :0.5743
3rd Qu.: 6.530 3rd Qu.:6.530 3rd Qu.:4.030 3rd Qu.:0.5900
Max. :10.020 Max. :9.940 Max. :6.430 Max. :0.9500
depth_perc
Min. :0.4300
1st Qu.:0.6110
Median :0.6190
Mean :0.6176
3rd Qu.:0.6250
Max. :0.7220
| cut | n | mean(price) |
|---|---|---|
| High | 8873 | 3865.20 |
| Acceptable | 4612 | 3977.29 |
| color | n | mean(price) |
|---|---|---|
| Tier 1 | 4134 | 3091.86 |
| Tier 2 | 5275 | 3871.84 |
| Tier 3 | 4076 | 4767.78 |
| clarity | n | mean(price) |
|---|---|---|
| Tier 1 | 2633 | 2894.31 |
| Tier 2 | 5070 | 3872.72 |
| Tier 3 | 5782 | 4390.13 |
We can see that about 69% of the data are categorized as ‘High’ in cut quality. Looking at the potential relationship, we can see the strongest relationships are with carat and length.
We see the largest concentration of diamonds’ price around $0-$5,000. The data is also skewed to the right. Looking at potential relationships, we can see strong relationships between price and carat, length, width, and depth, suggesting these variables have impacts on the price of a diamond.
The higher than average relationship between certain variables (example: width and leghth) may be sign of multicollinearity and we will probably address this later in the analysis.
Here is a look at the training and testing dataset
| Dataset | Number_of_Obs |
|---|---|
| Training | 10787 |
| Testing | 2698 |
Here is a look at a regression model predicting price
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 10637.883 | 1125.416 | 9.452 | 0.000 |
| carat | 10138.990 | 125.123 | 81.032 | 0.000 |
| cutAcceptable | -180.678 | 29.306 | -6.165 | 0.000 |
| colorTier 2 | -250.018 | 30.609 | -8.168 | 0.000 |
| colorTier 3 | -1139.673 | 33.367 | -34.156 | 0.000 |
| clarityTier 2 | -624.855 | 35.696 | -17.505 | 0.000 |
| clarityTier 3 | -1749.819 | 36.987 | -47.308 | 0.000 |
| length | -1396.654 | 182.366 | -7.659 | 0.000 |
| width | 630.738 | 158.448 | 3.981 | 0.000 |
| depth | 120.892 | 198.791 | 0.608 | 0.543 |
| table_perc | -5197.107 | 658.126 | -7.897 | 0.000 |
| depth_perc | -10311.035 | 1510.877 | -6.825 | 0.000 |
| Model | RMSE | RSquare | MAE |
|---|---|---|---|
| Linear Regression | 1251.52 | 0.902 | 818.497 |
Here is a look at a logistic regression model predicting diamond cut
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -42.594 | 5.919 | -7.196 | 0.000 |
| carat | 0.652 | 0.329 | 1.986 | 0.047 |
| colorTier 2 | -0.116 | 0.057 | -2.015 | 0.044 |
| colorTier 3 | -0.266 | 0.066 | -4.019 | 0.000 |
| clarityTier 2 | -0.013 | 0.069 | -0.195 | 0.846 |
| clarityTier 3 | 0.110 | 0.077 | 1.424 | 0.154 |
| price | 0.000 | 0.000 | -7.207 | 0.000 |
| length | -16.123 | 0.719 | -22.411 | 0.000 |
| width | 11.907 | 0.688 | 17.303 | 0.000 |
| depth | 7.238 | 1.587 | 4.561 | 0.000 |
| table_perc | 47.349 | 1.322 | 35.820 | 0.000 |
| depth_perc | 21.403 | 9.427 | 2.270 | 0.023 |
| Model | Accuracy | Sensitivity | Specificity | Precision | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.764 | 0.9 | 0.502 | 0.776 | 0.795 |
| Best_Cutoff | Sensitivity | Specificity | AUC |
|---|---|---|---|
| 0.632 | 0.787 | 0.672 | 0.795 |
depth is not statistically significant at p-value > 0.05, so we will prune it from the model.| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 10232.058 | 906.195 | 11.291 | 0 |
| carat | 10141.535 | 125.049 | 81.100 | 0 |
| cutAcceptable | -179.319 | 29.220 | -6.137 | 0 |
| colorTier 2 | -250.021 | 30.609 | -8.168 | 0 |
| colorTier 3 | -1139.869 | 33.365 | -34.164 | 0 |
| clarityTier 2 | -625.028 | 35.694 | -17.511 | 0 |
| clarityTier 3 | -1750.403 | 36.974 | -47.342 | 0 |
| length | -1336.464 | 153.167 | -8.726 | 0 |
| width | 643.988 | 156.938 | 4.103 | 0 |
| table_perc | -5218.762 | 657.142 | -7.942 | 0 |
| depth_perc | -9625.903 | 1006.697 | -9.562 | 0 |
| Model | RMSE | RSquare | MAE |
|---|---|---|---|
| Linear Regression | 1251.520 | 0.902 | 818.497 |
| Linear Reg., Prune Depth | 1251.294 | 0.902 | 818.318 |
log(price) because the price column is skewed to the right.| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -2.481 | 0.136 | -18.258 | 0 |
| carat | -0.860 | 0.019 | -45.874 | 0 |
| cutAcceptable | -0.058 | 0.004 | -13.195 | 0 |
| colorTier 2 | -0.084 | 0.005 | -18.301 | 0 |
| colorTier 3 | -0.283 | 0.005 | -56.652 | 0 |
| clarityTier 2 | -0.205 | 0.005 | -38.314 | 0 |
| clarityTier 3 | -0.441 | 0.006 | -79.483 | 0 |
| length | 0.215 | 0.023 | 9.366 | 0 |
| width | 1.094 | 0.024 | 46.486 | 0 |
| table_perc | 0.905 | 0.099 | 9.184 | 0 |
| depth_perc | 5.399 | 0.151 | 35.768 | 0 |
| Model | RMSE | RSquare | MAE |
|---|---|---|---|
| Linear Regression | 1251.520 | 0.902 | 818.497 |
| Linear Reg., Prune Depth | 1251.294 | 0.902 | 818.318 |
| Linear Reg., Log Price | 5547.157 | 0.781 | 3854.140 |
clarityTier2 is not statistically significant at p-value = 0.172, so we will prune it from the model.| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -42.558 | 5.915 | -7.195 | 0.000 |
| carat | 0.650 | 0.328 | 1.978 | 0.048 |
| colorTier 2 | -0.115 | 0.057 | -2.006 | 0.045 |
| colorTier 3 | -0.265 | 0.066 | -4.018 | 0.000 |
| clarityTier 3 | 0.120 | 0.055 | 2.180 | 0.029 |
| price | 0.000 | 0.000 | -7.271 | 0.000 |
| length | -16.126 | 0.719 | -22.422 | 0.000 |
| width | 11.905 | 0.688 | 17.305 | 0.000 |
| depth | 7.243 | 1.586 | 4.566 | 0.000 |
| table_perc | 47.337 | 1.320 | 35.855 | 0.000 |
| depth_perc | 21.354 | 9.421 | 2.267 | 0.023 |
| Model | Accuracy | Sensitivity | Specificity | Precision | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.764 | 0.900 | 0.502 | 0.776 | 0.795 |
| Logistic Reg. - Prune Clarity | 0.762 | 0.899 | 0.499 | 0.775 | 0.795 |
carat because it’s not statistically significant.| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -41.757 | 5.739 | -7.276 | 0.000 |
| colorTier 2 | -0.111 | 0.057 | -1.947 | 0.052 |
| colorTier 3 | -0.229 | 0.063 | -3.612 | 0.000 |
| clarityTier 3 | 0.146 | 0.053 | 2.736 | 0.006 |
| price | 0.000 | 0.000 | -8.025 | 0.000 |
| length | -16.086 | 0.711 | -22.636 | 0.000 |
| width | 11.679 | 0.668 | 17.481 | 0.000 |
| depth | 7.836 | 1.519 | 5.157 | 0.000 |
| table_perc | 47.578 | 1.315 | 36.175 | 0.000 |
| depth_perc | 18.810 | 9.079 | 2.072 | 0.038 |
| Model | Accuracy | Sensitivity | Specificity | Precision | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.764 | 0.900 | 0.502 | 0.776 | 0.795 |
| Logistic Reg. - Prune Clarity | 0.762 | 0.899 | 0.499 | 0.775 | 0.795 |
| Logistic Reg. - Prune Carat | 0.763 | 0.901 | 0.497 | 0.775 | 0.794 |
width because it’s not statistically significant.| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 25.393 | 4.208 | 6.034 | 0.000 |
| colorTier 2 | -0.117 | 0.056 | -2.090 | 0.037 |
| colorTier 3 | -0.223 | 0.062 | -3.594 | 0.000 |
| clarityTier 3 | 0.137 | 0.052 | 2.611 | 0.009 |
| price | 0.000 | 0.000 | -6.653 | 0.000 |
| length | -15.069 | 0.734 | -20.517 | 0.000 |
| depth | 24.919 | 1.195 | 20.851 | 0.000 |
| table_perc | 43.532 | 1.258 | 34.612 | 0.000 |
| depth_perc | -85.174 | 6.822 | -12.485 | 0.000 |
| Model | Accuracy | Sensitivity | Specificity | Precision | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.764 | 0.900 | 0.502 | 0.776 | 0.795 |
| Logistic Reg. - Prune Clarity | 0.762 | 0.899 | 0.499 | 0.775 | 0.795 |
| Logistic Reg. - Prune Carat | 0.763 | 0.901 | 0.497 | 0.775 | 0.794 |
| Logistic Reg. - Prune Width | 0.750 | 0.903 | 0.456 | 0.761 | 0.767 |
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
price ~ .
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = .preds()
trees = 500
min_n = 5
Engine-Specific Arguments:
importance = impurity
max.depth = 8
Computational engine: ranger
| Model | RMSE | RSquare | MAE |
|---|---|---|---|
| Linear Regression | 1251.520 | 0.902 | 818.497 |
| Linear Reg., Prune Depth | 1251.294 | 0.902 | 818.318 |
| Linear Reg., Log Price | 5547.157 | 0.781 | 3854.140 |
| Tuned Bagged Model | 853.026 | 0.954 | 455.336 |
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
price ~ .
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = 7
trees = 500
min_n = 5
Engine-Specific Arguments:
importance = impurity
max.depth = 8
Computational engine: ranger
| Model | RMSE | RSquare | MAE |
|---|---|---|---|
| Linear Regression | 1251.520 | 0.902 | 818.497 |
| Linear Reg., Prune Depth | 1251.294 | 0.902 | 818.318 |
| Linear Reg., Log Price | 5547.157 | 0.781 | 3854.140 |
| Tuned Bagged Model | 853.026 | 0.954 | 455.336 |
| Tuned Random Forest Model | 851.051 | 0.955 | 453.298 |
mtry) as 5, number of trees (trees) as 500, minimum number of observations (min_n) as 15, and maximum tree depth (max.depth) as 7.Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5, x), num.trees = ~500, min.node.size = min_rows(~15, x), importance = ~"impurity", max.depth = ~7, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
Type: Probability estimation
Number of trees: 500
Sample size: 10787
Number of independent variables: 9
Mtry: 5
Target node size: 15
Variable importance mode: impurity
Splitrule: gini
OOB prediction error (Brier s.): 0.1216869
| Model | Accuracy | Sensitivity | Specificity | Precision | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.764 | 0.900 | 0.502 | 0.776 | 0.795 |
| Logistic Reg. - Prune Clarity | 0.762 | 0.899 | 0.499 | 0.775 | 0.795 |
| Logistic Reg. - Prune Carat | 0.763 | 0.901 | 0.497 | 0.775 | 0.794 |
| Logistic Reg. - Prune Width | 0.750 | 0.903 | 0.456 | 0.761 | 0.767 |
| Classification Random Forest | 0.845 | 0.970 | 0.605 | 0.825 | 0.878 |
| Best_Cutoff | Sensitivity | Specificity | AUC |
|---|---|---|---|
| 0.668 | 0.895 | 0.724 | 0.878 |
| Model | Accuracy | Sensitivity | Specificity | Precision | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.764 | 0.900 | 0.502 | 0.776 | 0.795 |
| Logistic Reg. - Prune Clarity | 0.762 | 0.899 | 0.499 | 0.775 | 0.795 |
| Logistic Reg. - Prune Carat | 0.763 | 0.901 | 0.497 | 0.775 | 0.794 |
| Logistic Reg. - Prune Width | 0.750 | 0.903 | 0.456 | 0.761 | 0.767 |
| Classification Random Forest | 0.845 | 0.970 | 0.605 | 0.825 | 0.878 |
| Classification Random Forest with Cutoff | 0.837 | 0.895 | 0.724 | 0.862 | 0.878 |
class_xg_grid will contain a dataframe or tibble with 10 rows, where each row represents a unique combination of the specified hyperparameters.══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: boost_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
cut ~ .
── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 5
trees = 500
min_n = 15
tree_depth = 11
learn_rate = 0.00339794601607989
loss_reduction = 2.21670407129474e-10
Computational engine: xgboost
| Model | Accuracy | Sensitivity | Specificity | Precision | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.764 | 0.900 | 0.502 | 0.776 | 0.795 |
| Logistic Reg. - Prune Clarity | 0.762 | 0.899 | 0.499 | 0.775 | 0.795 |
| Logistic Reg. - Prune Carat | 0.763 | 0.901 | 0.497 | 0.775 | 0.794 |
| Logistic Reg. - Prune Width | 0.750 | 0.903 | 0.456 | 0.761 | 0.767 |
| Classification Random Forest | 0.845 | 0.970 | 0.605 | 0.825 | 0.878 |
| Classification Random Forest with Cutoff | 0.837 | 0.895 | 0.724 | 0.862 | 0.878 |
| Classification Gradient Boosted | 0.854 | 0.972 | 0.626 | 0.833 | 0.895 |
| Best_Cutoff | Sensitivity | Specificity | AUC |
|---|---|---|---|
| 0.646 | 0.86 | 0.776 | 0.895 |
| Model | Accuracy | Sensitivity | Specificity | Precision | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.764 | 0.900 | 0.502 | 0.776 | 0.795 |
| Logistic Reg. - Prune Clarity | 0.762 | 0.899 | 0.499 | 0.775 | 0.795 |
| Logistic Reg. - Prune Carat | 0.763 | 0.901 | 0.497 | 0.775 | 0.794 |
| Logistic Reg. - Prune Width | 0.750 | 0.903 | 0.456 | 0.761 | 0.767 |
| Classification Random Forest | 0.845 | 0.970 | 0.605 | 0.825 | 0.878 |
| Classification Random Forest with Cutoff | 0.837 | 0.895 | 0.724 | 0.862 | 0.878 |
| Classification Gradient Boosted | 0.854 | 0.972 | 0.626 | 0.833 | 0.895 |
| Classification Gradient Boosted with Cutoff | 0.831 | 0.860 | 0.776 | 0.881 | 0.895 |
To predict price, here are the most important variables:
For the price, how big, bright, and clear the diamond have a big impact!
To predict cut, here are the most important variables:
We can see that the different measurements of the diamond affect the cut of the diamond.
| Model | RMSE | RSquare | MAE |
|---|---|---|---|
| Linear Regression | 1251.52 | 0.90 | 818.50 |
| Linear Reg., Prune Depth | 1251.29 | 0.90 | 818.32 |
| Linear Reg., Log Price | 5547.16 | 0.78 | 3854.14 |
| Tuned Bagged Model | 853.03 | 0.95 | 455.34 |
| Tuned Random Forest Model | 851.05 | 0.95 | 453.30 |
cut prediction, I would recommend using Gradient Boosted model due to the high number of sensitivity, precision, and AUC. The model was able to catch 86% - 97% of the true positive values, and out of the positive values it predicted, 83% - 88% were true (depending on whether you use a best cutoff or not).| Model | Accuracy | Sensitivity | Specificity | Precision | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.76 | 0.90 | 0.50 | 0.78 | 0.80 |
| Logistic Reg. - Prune Clarity | 0.76 | 0.90 | 0.50 | 0.78 | 0.80 |
| Logistic Reg. - Prune Carat | 0.76 | 0.90 | 0.50 | 0.78 | 0.79 |
| Logistic Reg. - Prune Width | 0.75 | 0.90 | 0.46 | 0.76 | 0.77 |
| Classification Random Forest | 0.84 | 0.97 | 0.60 | 0.83 | 0.88 |
| Classification Random Forest with Cutoff | 0.84 | 0.90 | 0.72 | 0.86 | 0.88 |
| Classification Gradient Boosted | 0.85 | 0.97 | 0.63 | 0.83 | 0.90 |
| Classification Gradient Boosted with Cutoff | 0.83 | 0.86 | 0.78 | 0.88 | 0.90 |
clarity, color, and cut and it was difficult to determine how to re-categorize them so that I only have about 2 to 3 categories in each column.If I had another week to work on this project, I would love to try out more models we have done in class, such as Lasso and SVC. I would also try to normalize and balance the data set like what we did in homework case #2 and #3, to see if transforming the data set has any impacts on the performance of the models.